Random Sampling over Joins Revisited

نویسندگان

Zhuoyue Zhao

Robert Christensen

Feifei Li

Xiao Hu

Ke Yi

چکیده

Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R ▷◁ S ) , sample(R) ▷◁ sample(S ). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Perfect and Maximum Randomness in Stratified Sampling over Joins

Supporting sampling in the presence of joins is an important problem in data analysis. Pushing down the sampling operator through both sides of the join is inherently challenging due to data skew and correlation issues between output tuples. Joining simple random samples of base relations typically leads to results that are non-random. Current solutions to this problem perform biased sampling o...

متن کامل

Linked Bernoulli Synopses: Sampling along Foreign Keys

Random sampling is a popular technique for providing fast approximate query answers, especially in data warehouse environments. Compared to other types of synopses, random sampling bears the advantage of retaining the dataset’s dimensionality; it also associates probabilistic error bounds with the query results. Most of the available sampling techniques focus on table-level sampling, that is, t...

متن کامل

Selectivity Estimation for Joins Using Systematic Sampling

We propose a new approach to the estimation of join selectivity. The technique, which we have called “systematic sampling”, is a novel variant of the sampling-based approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending/descending order via an index, if n is the number of tuples to be sampled from R, select a tuple...

متن کامل

Memory-Limited Execution of Windowed Stream Joins

We address the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result tuples as possible. An alternative scenario is to provide a random sample of the join result, e...

متن کامل

An interactive framework for spatial joins: a statistical approach to data analysis in GIS

Many Geographic Information Systems (GIS) handle a large volume of geospatial data. Spatial joins over two or more geospatial datasets are very common operations in GIS for data analysis and decision support. However, evaluating spatial joins can be very time intensive due to the size of datasets. In this paper, we propose an interactive framework that provides faster approximate answers of spa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Random Sampling over Joins Revisited

نویسندگان

چکیده

منابع مشابه

Perfect and Maximum Randomness in Stratified Sampling over Joins

Linked Bernoulli Synopses: Sampling along Foreign Keys

Selectivity Estimation for Joins Using Systematic Sampling

Memory-Limited Execution of Windowed Stream Joins

An interactive framework for spatial joins: a statistical approach to data analysis in GIS

عنوان ژورنال:

اشتراک گذاری